Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] fix gRPC handlers' unlimited active calls configuration #25626

Merged
merged 3 commits into from
Jun 10, 2022

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented Jun 9, 2022

Why are these changes needed?

Ray's gRPC server wrapper configures a max active call setting for each handler. When the max active call is -1, the handler is supposed to allow handling unlimited number of requests concurrently. However in practice it is often observed that handlers configured with unlimited active calls are still handling at most 100 requests concurrently.

This is a result of the existing logic:

  1. At a high level, each gRPC method is associated with a number of ServerCall objects (acting as "tags") in the gRPC completion queue. When there is no tag for a method, gRPC server thread will not be able to poll requests from the method call from the completion queue. After a request is polled from the completion queue, it is processed by the polling gRPC server thread, then queued to an eventloop.
  2. When a handler is in the "unlimited" mode, it creates when a new ServerCall object (tag) before actual processing. The problem is that new ServerCalls are created on the eventloop instead of the gRPC server thread. When the event loop runs a callback from the gRPC server, the callback creates a new ServerCall object, and can run the gRPC handler to completion if the handler does not have any async step. So overall, the event loop will not run more callbacks than the initial number of ServerCalls, which is 100 in the "unlimited" mode.

The solution is to create a new ServerCall in the gRPC server thread, before sending the ServerCall to the eventloop.

Running some night tests to verify the fix does not introduce instabilities: https://buildkite.com/ray-project/release-tests-branch/builds/652

Also, looking into adding gRPC server / client stress tests with large number of concurrent requests.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@mwtian mwtian changed the title [WIP] [Core] fix gRPC handlers' unlimited active calls configuration Jun 9, 2022
@mwtian mwtian marked this pull request as ready for review June 9, 2022 22:30
int buffer_size = 100;
// When there is no max active RPC limit, a call will be added to the completetion
// queue before processing starts, so adding only 1 call is enough.
int buffer_size = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with this. But does this mean we'll accept request slower than before? For example, accept the request 1by1 because the size ie only 1 and only after we do the actual processing, a new one will be added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nightly tests do not show a big difference, but from reading the code of completion queue, I can see that having a number of tags buffered can make processing more efficient. Adjusted the value to be closer to the existing value.

@fishbone fishbone merged commit dcfed61 into ray-project:master Jun 10, 2022
@mwtian mwtian deleted the grpc-fix branch June 29, 2022 20:47
mwtian added a commit that referenced this pull request Jun 29, 2022
truelegion47 pushed a commit to truelegion47/ray that referenced this pull request Jul 6, 2022
* master: (104 commits)
  [Serve] Java Client API and End to End Tests (ray-project#22726)
  [Docs] Small fix to AIR examples descriptions (ray-project#26227)
  [Deployment Graph] Move `Deployment` creation outside to build function (ray-project#26129)
  [K8s][Ray Operator] Ignore resource requests when detected container resources. (ray-project#26234)
  Revert "[Core] Add retry exception allowlist for user-defined filteri… (ray-project#26289)
  [ci] pin gpustat (ray-project#26311)
  [tune] fix `set_tune_experiment` (ray-project#26298)
  Revert "Revert "[AIR][Serve] Rename ModelWrapperDeployment -> PredictorDeployment"" (ray-project#26231)
  [Release] Use nightly base images for release tests (ray-project#25373)
  Revert "[Core] fix gRPC handlers' unlimited active calls configuration (ray-project#25626)" (ray-project#26202)
  [RLlib] Some Docs fixes (2). (ray-project#26265)
  [C++ worker] Refine worker context and more (ray-project#26281)
  Fix file_system_monitor.cc message (ray-project#26143)
  [Java] Make Java test more stable (ray-project#26282)
  [air] Do not warn of `checkpoint_dir` if it's coming from us (base_trainer). (ray-project#26259)
  [Datasets] Support drop_columns API (ray-project#26200)
  [Datasets] Fix max number of actors for default actor pool strategy (ray-project#26266)
  [ci] Stop syncer staging tests (ray-project#26273)
  [core][gcs] Add storage namespace to redis storage in GCS. (ray-project#25994)
  [workflow] Deprecate workflow.create (ray-project#26106)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants